Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.
translated by 谷歌翻译
命名实体识别(ner)是一种信息提取技术,其旨在在文档中定位和分类为预定义类别的文档中的命名实体(例如,组织,位置,......)。正确识别这些短语在简化信息访问方面发挥着重要作用。但是,它仍然是一项艰巨的任务,因为命名实体(NES)具有多种形式,它们是上下文相关的。虽然上下文可以通过上下文特征来表示,但是这些模型通常误解了全局关系。在本文中,我们提出了从图形卷积网络(GCN)的XLNET和全局特征的上下文特征的组合来增强NER性能。在一个广泛使用的数据集,2003年的实验,展示了我们战略的好处,结果与现有技术(SOTA)竞争。
translated by 谷歌翻译
随着越来越多的可用文本数据,能够自动分析,分类和摘要这些数据的算法的开发已成为必需品。在本研究中,我们提出了一种用于关键字识别的新颖算法,即表示给定文档的关键方面的一个或多字短语的提取,称为基于变压器的神经标记器,用于关键字识别(TNT-KID)。通过将变压器架构适用于手头的特定任务并利用域特定语料库上的预先磨损的语言模型,该模型能够通过提供竞争和强大的方式克服监督和无监督的最先进方法的缺陷在各种不同的数据集中的性能,同时仅需要最佳执行系统所需的手动标记的数据。本研究还提供了彻底的错误分析,具有对模型内部运作的有价值的见解和一种消融研究,测量关键字识别工作流程的特定组分对整体性能的影响。
translated by 谷歌翻译